Project Overview

This report has been prepared for the edX HarvardX: PH125.9x capstone requirement and the data is available from Kaggle here.

The data consists of 7043 unique customer values, with 21 columns and an overall target “Churn” column.

My objective in this project is to initially analyse and provide insights from the data and ultimately use this to train a Churn prediction model which could be used for busines purposes.

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Wrangling

Note there are a small number of missing values in the ‘TotalCharges’ observations. My decision here is to predict a mean value for this missing observation and use all available data, rather than ignore this information completely.

It appears the SeniorCitizen observations have unique values of either 0 or 1 and are actually a factor and these formats will be coerced.

## [1] 0 1

Expolration

Now that the dataset is complete let’s gather some some insights through exploratory data analysis (EDA).

The aim here is to understand key insights for why customers may be churning to identify trends, which can provide a basis for future strategy.

Feature Importance

Now that we have some initial eda lets use machine learning to understand what features are the most predictive for churn and which are not. Boruta uses a powerful randomforest algorithm to calculate importance.

## Boruta performed 99 iterations in 10.22079 mins.
##  17 attributes confirmed important: Contract, Dependents,
## DeviceProtection, InternetService, MonthlyCharges and 12 more;
##  1 attributes confirmed unimportant: gender;
##  1 tentative attributes left: PhoneService;

##  [1] "SeniorCitizen"    "Partner"          "Dependents"       "tenure"          
##  [5] "MultipleLines"    "InternetService"  "OnlineSecurity"   "OnlineBackup"    
##  [9] "DeviceProtection" "TechSupport"      "StreamingTV"      "StreamingMovies" 
## [13] "Contract"         "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"  
## [17] "TotalCharges"

Modelling

Now we will create training and test sets, with a caret powered randomforest and an automated machine learning package, h2o to compare results. From a business case perspective, an accurate predictive model could be used to fire targeted offers at customers with predicted churn.

## Confusion Matrix and Statistics
## 
##        
## rf_pred  No Yes
##     No  484 108
##     Yes  34  79
##                                          
##                Accuracy : 0.7986         
##                  95% CI : (0.767, 0.8276)
##     No Information Rate : 0.7348         
##     P-Value [Acc > NIR] : 4.817e-05      
##                                          
##                   Kappa : 0.4085         
##                                          
##  Mcnemar's Test P-Value : 9.010e-10      
##                                          
##             Sensitivity : 0.9344         
##             Specificity : 0.4225         
##          Pos Pred Value : 0.8176         
##          Neg Pred Value : 0.6991         
##              Prevalence : 0.7348         
##          Detection Rate : 0.6865         
##    Detection Prevalence : 0.8397         
##       Balanced Accuracy : 0.6784         
##                                          
##        'Positive' Class : No             
## 

Automated Machine Learning (AML)

Now we’ll use an automated alogorithm that searches for the best fit model including stacked ensembles and compare results.

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         7 hours 15 minutes 
##     H2O cluster timezone:       Australia/Brisbane 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.1 
##     H2O cluster version age:    4 months and 25 days !!! 
##     H2O cluster name:           H2O_started_from_R_bmr057_xik111 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.17 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 25 days)!
## Please download and install the latest version from http://h2o.ai/download/
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
## 22:30:23.748: AutoML: XGBoost is not available; skipping it.
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.365926491151333:
##         No Yes    Error      Rate
## No     438  80 0.154440   =80/518
## Yes     53 134 0.283422   =53/187
## Totals 491 214 0.188652  =133/705
method accuracy
randomforest 0.7985816
aml 0.8850385

Conclusion

It is possible to predict churn for this business with a reasonable level of accuracy and the recommendation would be to build a business case to understand the benefit of targeted promotions for churn predicted customers vs the cost of activation. The promotional cost of such an activity could be reduced by targeting customers with a higher probability of churning.

Thank you for grading this assignment and to the edX team who have created this training content. By taking the data science track I have found some new powerful skills which are a real asset for myself and any prospective employer and these skills are already being utilised.

It is clear that AML processes are very powerful also, though I could not have realised this, without having learned the science and programming background required for other methods - hard to believe such powerful tools are open source and freely available.